Search CORE

8 research outputs found

LOD-Connected Offensive Language Ontology and Tagset Enrichment

Author: Bączkowska Anna
Lewandowska-Tomaszczyk Barbara
Liebeskind Chaya
Mitrović Jelena
Valunaite Oleskeviciene Giedre
Žitnik Slavko
Publication venue
Publication date: 01/01/2021
Field of study

CC BY 4.0The main focus of the paper is the definitional revision and enrichment of offensive language typology, making reference to publicly available offensive language datasets and testing them on available pretrained lexical embedding systems. We review over 60 available corpora and compare tagging schemas applied there while making an attempt to explain semantic differences between particular concepts of the category OFFENSIVE in English. A finite set of classes that cover aspects of offensive language representation along with linguistically sound explanations is presented, based on the categories originally proposed by Zampieri et al. [1, 2] in terms of offensive language categorization schemata and tested by means of Sketch Engine tools on a large web-based corpus. The schemata are juxtaposed and discussed with reference to non-contextual word embeddings FastText, Word2Vec, and Glove. The methodology for mapping from existing corpora to a unified ontology as presented in this paper is provided. The proposed schema will enable further comparable research and effective use of corpora of languages other than English. It will also be applied in building an enriched tagset to be trained and used on new data, with the application of recently developed LLOD techniques [3]

Mykolas Romeris University Institutional Repository

A survey of guidelines and best practices for the generation, interlinking, publication, and validation of linguistic linked data

Author: Chiarcos Christian
Declerck Thierry
Di Buono Maria Pia
Dojchinovski Milan
Gifu Daniela
Gracia Jorge
Khan Fahad
Valunaite Oleskeviciene Giedre
Publication venue
Publication date: 24/04/2023
Field of study

This article discusses a survey carried out within the NexusLinguarum COST Action which aimed to give an overview of existing guidelines (GLs) and best practices (BPs) in linguistic linked data. In particular it focused on four core tasks in the production/publication of linked data: generation, interlinking, publication, and validation. We discuss the importance of GLs and BPs for LLD before describing the survey and its results in full. Finally we offer a number of directions for future work in order to address the findings of the survey

OPUS Augsburg

A Survey of Guidelines and Best Practices for the Generation, Interlinking, Publication, and Validation of Linguistic Linked Data

Author: Anas Fahad Khan
Christian Chiarcos
Daniela Gifu
di Buono Maria Pia
Giedre Valunaite Oleskeviciene
Jorge Gracia
Milan Dojchinovski
Thierry Declerck
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2022
Field of study

Università degli Studi di Napoli L'Orientale: CINECA IRIS

An OWL ontology for ISO-based discourse marker annotation

Author: Apostol Elena-Simona
Baczkowska Anna
Chiarcos Christian
Damova Mariana
Liebeskind Chaya
Silvano Maria da Purificação
Trajanov Dimitar
Truica Ciprian-Octavian
Valunaite-Oleskeviciene Giedre
Publication venue
Publication date: 01/01/2022
Field of study

Purpose: Discourse markers are linguistic cues that indicate how an utterance relates to the discourse context and what role it plays in conversation. The authors are preparing an annotated corpus in nine languages, and specifically aim to explore the role of Linguistic Linked Open Data (/LLOD) technologies in the process, i.e., the application of web standards such as RDF and the Web Ontology Language (OWL) for publishing and integrating data. We demonstrate the advantages of this approach

Repositório Aberto da Universidade do Porto

Mykolas Romeris University Institutional Repository

Validation of language agnostic models for discourse marker detection

Author: Apostol Elena-Simona
Baczkowska Anna
Chiarcos Christian
Damova Mariana
Liebeskind Chaya
Mishev Kostadin
Oleskeviciene Giedre Valunaite
Silvano Maria da Purificação
Trajanov Dimitar
Truica Ciprian-Octavian
Publication venue
Publication date: 01/01/2023
Field of study

Using language models to detect or predict the presence of language phenomena in the text has become a mainstream research topic. With the rise of generative models, experiments using deep learning and transformer models trigger intense interest. Aspects like precision of predictions, portability to other languages or phenomena, scale have been central to the research community. Discourse markers, as language phenomena, perform important functions, such as signposting, signalling, and rephrasing, by facilitating discourse organization. Our paper is about discourse markers detection, a complex task as it pertains to a language phenomenon manifested by expressions that can occur as content words in some contexts and as discourse markers in others. We have adopted language agnostic model trained in English to predict the discourse marker presence in texts in 8 other unseen by the model languages with the goal to evaluate how well the model performs in different structure and lexical properties languages. We report on the process of evaluation and validation of the model's performance across European Portuguese, Hebrew, German, Polish, Romanian, Bulgarian, Macedonian, and Lithuanian and about the results of this validation. This research is a key step towards multilingual language processing

Repositório Aberto da Universidade do Porto

TED-ELH Parallel Corpus (ELEXIS)

Author: Liebeskind Chaya
Valunaite Oleskeviciene Giedre
Publication venue: Jerusalem College of technology
Publication date: 12/02/2021
Field of study

The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data. See also: http://hdl.handle.net/20.500.11821/3

Common Language Resources and Technology Infrastructure - Slovenia

A Survey of Guidelines and Best Practices for the Generation, Interlinking, Publication, and Validation of Linguistic Linked Data

Author: Anas Fahad Khan
Christian Chiarcos
Daniela Gifu
Giedre Valunaite Oleskeviciene
Jorge Gracia
Maria Pia di Buono
Milan Dojchinovski
Thierry Declerck
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2022
Field of study

ARCHIVIO ISTITUZIONALE DELLA RICERCA-UNIVERSITA' DEGLI STUDI DI NAPOLI "L'ORIENTALE"

Balancing the digital presence of languages in and for technological development: a policy brief on the inclusion of data of under-resourced languages into the linked data cloud

Author: Bosque-Gil Julia
Chiarcos Christian
Declerck Thierry
Dojchinovsk M.
Gracia Jorge
Ionov Maxim
Mititelu Verginica Barbu
Oliveira Hugo Gonçalo
Rychkova Liudmila
Valunaite Oleskeviciene Giedre
Publication venue
Publication date: 24/04/2023
Field of study

OPUS Augsburg